Comparing audio- and a-posteriori-probability-based stream confidence measures for audio-visual speech recognition
نویسندگان
چکیده
During the fusion of audio and video information for speech recognition, the estimation of the reliability of the noise affected audio channel is crucial to get meaningful recognition results. In this paper we compare two types of reliability measures. One is the use of the statistics of the phoneme a-posteriori probabilities and the other is the analysis of the audio signal itself. We implemented the entropy and the dispersion of the probabilities and, from the audio-based criteria, the so called Voicing Index. To test the criteria a hybrid ANN/HMM audio-visual recognition system was used and 5 different types of noise at 12 SNR levels each were added to the audio signal. The best sigmoidal fit for each criterion between the fusion parameter and the value of the criterion over all noise types and SNR values was performed. The resulting individual errors and the corresponding averaged relative errors are given.
منابع مشابه
Stream confidence estimation for audio-visual speech recognition
We investigate the use of single modality confidence measures as a means of estimating adaptive, local weights for improved audio-visual automatic speech recognition. We limit our work to the toy problem of audio-visual phonetic classification by means of a two-stream Gaussian mixture model (GMM), where each stream models the class conditional audioor visual-only observation probability, raised...
متن کاملAdaptive Audio-visual Speech Recognition in the Presence of Audio and Video Distortions
Audio-visual speech recognition leads to significant improvements compared to pure audio recognition especially when the audio signal is corrupted by noise. In this article we investigate the consequences of additional degradations in the video signal on the audio-visual recognition process.. We degrade the images with noise, a JPEG compression, and errors in the localization of the mouth regio...
متن کاملAudio-visual Reliability Estimates Using Stream Entropy for Speech Recognition
We present a method for multimodal fusion based on the estimated reliability of each individual modality. Our method uses an information theoretic measure, the entropy derived from the state probability distribution for each stream, as an estimate of reliability. Our application is audio-visual speech recognition. The two modalities, audio and video, are weighted at each time instant according ...
متن کاملDynamic stream weight estimation in coupled-HMM-based audio-visual speech recognition using multilayer perceptrons
Jointly using audio and video features can increase the robustness of automatic speech recognition systems in noisy environments. A systematic and reliable performance gain, however, is only achieved if the contributions of the audio and video stream to the decoding decision are dynamically optimized, for example via so-called stream weights. In this paper, we address the problem of dynamic str...
متن کاملStream weight optimization of speech and lip image sequence for audio-visual speech recognition
Bimodal speech recognition systems, with the use of visual information to supplement acoustic information, have been shown to yield better recognition performance than purely acoustic systems, especially when background noise is present. The early integration strategy for HMM-based audio-visual speech recognition is one promising approach, where the output probability is obtaned by product of o...
متن کامل